teg AWS EKS中服务权限管理完全指南

葫芦的运维日志

浏览量 15 2025/12/25 15:32

AWS EKS 中服务权限管理完全指南

在 EKS 里跑的服务要访问 S3、DynamoDB、SQS 这些 AWS 资源,权限怎么给?这个问题看似简单,但方案有好几种,每种的安全边界、运维复杂度、适用场景都不一样。选错了轻则权限过大留下安全隐患,重则服务跑不起来。这篇文章把所有方案从粗到细全部讲透,每种都给出完整的 AWS 侧和 K8s 侧配置。

一、权限方案全景图

方案 粒度 安全性 复杂度 推荐度
Node Group IAM Role 节点级别(所有 Pod 共享) ⭐⭐
IRSA (IAM Roles for Service Accounts) ServiceAccount 级别 ⭐⭐⭐⭐⭐
EKS Pod Identity ServiceAccount 级别 ⭐⭐⭐⭐⭐
kube2iam / kiam Pod 级别(annotation) ⭐⭐(已过时)
Access Key 硬编码 容器级别 极低 ⭐(禁止使用)

先说结论:2024 年以后的新项目,首选 EKS Pod Identity;存量项目用 IRSA;Node Group Role 只用于节点自身需要的权限(如 ECR 拉镜像);永远不要用 Access Key。

二、方案一:Node Group IAM Role(节点级别权限)

2.1 原理

每个 Node Group 关联一个 IAM Role,该节点上的所有 Pod 都可以通过 EC2 Instance Metadata Service (IMDS) 获取这个 Role 的临时凭证。这是最简单也是最粗暴的方式。

EC2 Node (Node Group)
IAM Role: eks-node-role
Pod A
(需要S3)
Pod B
(需要SQS)
⚠ 两个 Pod 都能访问 S3 和 SQS(权限过大)

2.2 Terraform 配置 — IAM Role

# Node Group 的 IAM Role
resource "aws_iam_role" "eks_node" {
  name = "${var.cluster_name}-node-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
    }]
  })
}

# EKS 节点必需的基础策略(这些是必须的)
resource "aws_iam_role_policy_attachment" "node_AmazonEKSWorkerNodePolicy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
  role       = aws_iam_role.eks_node.name
}

resource "aws_iam_role_policy_attachment" "node_AmazonEKS_CNI_Policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  role       = aws_iam_role.eks_node.name
}

resource "aws_iam_role_policy_attachment" "node_AmazonEC2ContainerRegistryReadOnly" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
  role       = aws_iam_role.eks_node.name
}

# 如果用 Node Group Role 给业务权限(不推荐,但有时不得不用)
resource "aws_iam_role_policy" "node_s3_access" {
  name = "s3-access"
  role = aws_iam_role.eks_node.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Action = [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ]
      Resource = [
        "arn:aws:s3:::my-app-bucket",
        "arn:aws:s3:::my-app-bucket/*"
      ]
    }]
  })
}

# Node Group
resource "aws_eks_node_group" "main" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "main"
  node_role_arn   = aws_iam_role.eks_node.arn
  subnet_ids      = var.private_subnet_ids

  scaling_config {
    desired_size = 3
    max_size     = 10
    min_size     = 1
  }

  instance_types = ["m5.large"]
}

2.3 Deployment 配置

使用 Node Group Role 时,Deployment 不需要任何特殊配置,Pod 自动继承节点的 IAM Role:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/my-app:latest
        # 不需要任何 IAM 相关配置
        # Pod 自动通过 IMDS 获取节点的 IAM Role 凭证
        env:
        - name: AWS_DEFAULT_REGION
          value: "ap-southeast-1"

2.4 问题与风险

  • 权限爆炸:节点上所有 Pod 共享同一个 Role。如果 Pod A 需要 S3,Pod B 需要 DynamoDB,Node Role 就得同时有 S3 和 DynamoDB 权限,Pod A 也能访问 DynamoDB
  • 横向移动风险:攻击者拿下任意一个 Pod,就能获取节点上所有 AWS 权限
  • IMDS 攻击:Pod 可以直接访问 169.254.169.254 获取凭证。虽然 IMDSv2 缓解了部分风险,但根本问题没解决

什么时候用 Node Group Role?只用于节点自身运行所需的权限:拉取 ECR 镜像、CNI 网络插件、CloudWatch 日志。业务权限不要放在这里。

2.5 限制 IMDS 访问(安全加固)

如果你用 IRSA 或 Pod Identity 给 Pod 赋权,应该阻止 Pod 访问 IMDS,防止它们"偷"节点的 Role:

# Node Group 启动模板:强制 IMDSv2 + 限制 hop
resource "aws_launch_template" "eks_node" {
  name_prefix = "${var.cluster_name}-node-"

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"  # 强制 IMDSv2
    http_put_response_hop_limit = 1           # 关键:设为 1,容器内无法访问 IMDS
  }

  # ... 其他配置
}

hop_limit = 1 意味着 IMDS 请求只能从 EC2 实例本身发起,容器内的请求(多了一跳)会被拒绝。这是使用 IRSA/Pod Identity 时的安全最佳实践。

三、方案二:IRSA — IAM Roles for Service Accounts(推荐)

3.1 原理

IRSA 是 AWS 官方推荐的 Pod 级别权限方案。核心思路:

  1. EKS 集群有一个 OIDC Provider
  2. 创建 IAM Role,信任策略指定只有特定 namespace 的特定 ServiceAccount 可以 AssumeRole
  3. K8s ServiceAccount 上加 annotation 指向这个 IAM Role
  4. Pod 使用这个 ServiceAccount 后,AWS SDK 自动通过 STS 获取对应 Role 的临时凭证
EC2 Node
Pod A
SA: s3-reader
Role: S3ReadRole
→ 只能读 S3
Pod B
SA: sqs-writer
Role: SQSRole
→ 只能写 SQS
✅ 每个 Pod 有独立的最小权限

3.2 第一步:创建 OIDC Provider

# 获取 EKS 集群的 OIDC 信息
data "aws_eks_cluster" "main" {
  name = var.cluster_name
}

data "tls_certificate" "eks" {
  url = data.aws_eks_cluster.main.identity[0].oidc[0].issuer
}

# 创建 OIDC Provider(每个集群只需要创建一次)
resource "aws_iam_openid_connect_provider" "eks" {
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.eks.certificates[0].sha1_fingerprint]
  url             = data.aws_eks_cluster.main.identity[0].oidc[0].issuer

  tags = {
    Cluster = var.cluster_name
  }
}

3.3 第二步:创建 IAM Role(带 OIDC 信任策略)

locals {
  oidc_provider_arn = aws_iam_openid_connect_provider.eks.arn
  oidc_provider_url = replace(
    data.aws_eks_cluster.main.identity[0].oidc[0].issuer, 
    "https://", ""
  )
}

# 给 order-service 创建专用 IAM Role
resource "aws_iam_role" "order_service" {
  name = "${var.cluster_name}-order-service"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = local.oidc_provider_arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          # 限制:只有 production namespace 的 order-service SA 可以 assume
          "${local.oidc_provider_url}:sub" = "system:serviceaccount:production:order-service"
          "${local.oidc_provider_url}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}

# 赋予具体权限:只能访问订单相关的 DynamoDB 表和 SQS 队列
resource "aws_iam_role_policy" "order_service" {
  name = "order-service-policy"
  role = aws_iam_role.order_service.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "dynamodb:GetItem",
          "dynamodb:PutItem",
          "dynamodb:UpdateItem",
          "dynamodb:Query"
        ]
        Resource = [
          "arn:aws:dynamodb:ap-southeast-1:123456789:table/orders",
          "arn:aws:dynamodb:ap-southeast-1:123456789:table/orders/index/*"
        ]
      },
      {
        Effect = "Allow"
        Action = [
          "sqs:SendMessage",
          "sqs:ReceiveMessage",
          "sqs:DeleteMessage"
        ]
        Resource = "arn:aws:sqs:ap-southeast-1:123456789:order-events"
      }
    ]
  })
}

3.4 第三步:创建 K8s ServiceAccount

# service-account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: order-service
  namespace: production
  annotations:
    # 关键:这个 annotation 把 SA 和 IAM Role 关联起来
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-cluster-order-service
  labels:
    app: order-service

3.5 第四步:Deployment 使用 ServiceAccount

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      serviceAccountName: order-service  # 关键:指定 ServiceAccount
      containers:
      - name: order-service
        image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/order-service:v1.2.0
        ports:
        - containerPort: 8080
        env:
        - name: AWS_DEFAULT_REGION
          value: "ap-southeast-1"
        # 不需要设置 AWS_ACCESS_KEY_ID 或 AWS_SECRET_ACCESS_KEY
        # IRSA 会自动注入以下环境变量:
        #   AWS_ROLE_ARN
        #   AWS_WEB_IDENTITY_TOKEN_FILE
        resources:
          requests:
            cpu: 100m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi

IRSA 的工作机制:EKS 的 Mutating Webhook 会自动给 Pod 注入一个 projected volume(包含 JWT token)和两个环境变量。AWS SDK 检测到这些环境变量后,会自动调用 STS AssumeRoleWithWebIdentity 获取临时凭证。整个过程对应用代码完全透明。

3.6 验证 IRSA 是否生效

# 进入 Pod 检查环境变量
kubectl exec -it deploy/order-service -n production -- env | grep AWS

# 应该看到:
# AWS_ROLE_ARN=arn:aws:iam::123456789:role/my-cluster-order-service
# AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
# AWS_DEFAULT_REGION=ap-southeast-1

# 检查 Pod 的实际身份
kubectl exec -it deploy/order-service -n production -- \
  aws sts get-caller-identity

# 应该返回 IRSA 的 Role,而不是 Node Role:
# {
#   "UserId": "AROA...:botocore-session-...",
#   "Account": "123456789",
#   "Arn": "arn:aws:sts::123456789:assumed-role/my-cluster-order-service/..."
# }

# 检查 projected volume 是否挂载
kubectl get pod -n production -l app=order-service -o yaml | grep -A5 "projected"

3.7 IRSA 的 Terraform 模块化

当服务多了,每个都写一遍很繁琐。封装成模块:

# modules/irsa/main.tf
variable "cluster_name" {}
variable "oidc_provider_arn" {}
variable "oidc_provider_url" {}
variable "namespace" {}
variable "service_account_name" {}
variable "policy_json" {}

resource "aws_iam_role" "this" {
  name = "${var.cluster_name}-${var.namespace}-${var.service_account_name}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = var.oidc_provider_arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${var.oidc_provider_url}:sub" = "system:serviceaccount:${var.namespace}:${var.service_account_name}"
          "${var.oidc_provider_url}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}

resource "aws_iam_role_policy" "this" {
  name   = "${var.service_account_name}-policy"
  role   = aws_iam_role.this.id
  policy = var.policy_json
}

output "role_arn" {
  value = aws_iam_role.this.arn
}

# ============================================
# 调用模块 — 一行搞定一个服务的权限
# ============================================

module "irsa_order_service" {
  source               = "./modules/irsa"
  cluster_name         = var.cluster_name
  oidc_provider_arn    = aws_iam_openid_connect_provider.eks.arn
  oidc_provider_url    = local.oidc_provider_url
  namespace            = "production"
  service_account_name = "order-service"
  policy_json = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["dynamodb:*"]
      Resource = "arn:aws:dynamodb:*:*:table/orders*"
    }]
  })
}

module "irsa_payment_service" {
  source               = "./modules/irsa"
  cluster_name         = var.cluster_name
  oidc_provider_arn    = aws_iam_openid_connect_provider.eks.arn
  oidc_provider_url    = local.oidc_provider_url
  namespace            = "production"
  service_account_name = "payment-service"
  policy_json = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["sqs:*"]
        Resource = "arn:aws:sqs:*:*:payment-*"
      },
      {
        Effect   = "Allow"
        Action   = ["kms:Decrypt", "kms:GenerateDataKey"]
        Resource = "arn:aws:kms:ap-southeast-1:123456789:key/xxx"
      }
    ]
  })
}

四、方案三:EKS Pod Identity(最新推荐)

4.1 原理

EKS Pod Identity 是 2023 年底推出的新方案,目标是简化 IRSA 的配置。核心区别:

  • 不需要 OIDC Provider:不用创建和管理 OIDC Provider
  • 信任策略更简单:不需要在 IAM Role 的信任策略里写 OIDC URL
  • 通过 EKS API 关联:用 aws_eks_pod_identity_association 资源把 Role 和 ServiceAccount 关联

4.2 第一步:安装 Pod Identity Agent

# EKS Pod Identity Agent Add-on
resource "aws_eks_addon" "pod_identity_agent" {
  cluster_name = aws_eks_cluster.main.name
  addon_name   = "eks-pod-identity-agent"
  
  # 确保 addon 版本兼容
  resolve_conflicts_on_update = "OVERWRITE"
}

4.3 第二步:创建 IAM Role

# 注意:信任策略比 IRSA 简单很多
resource "aws_iam_role" "notification_service" {
  name = "${var.cluster_name}-notification-service"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Service = "pods.eks.amazonaws.com"  # 固定值,不需要 OIDC URL
      }
      Action = [
        "sts:AssumeRole",
        "sts:TagSession"  # Pod Identity 需要这个
      ]
    }]
  })
}

# 权限策略
resource "aws_iam_role_policy" "notification_service" {
  name = "notification-policy"
  role = aws_iam_role.notification_service.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "sns:Publish"
        ]
        Resource = "arn:aws:sns:ap-southeast-1:123456789:user-notifications"
      },
      {
        Effect = "Allow"
        Action = [
          "ses:SendEmail",
          "ses:SendRawEmail"
        ]
        Resource = "*"
        Condition = {
          StringEquals = {
            "ses:FromAddress" = "[email]"
          }
        }
      }
    ]
  })
}

4.4 第三步:创建 Pod Identity Association

# 这一步替代了 IRSA 中 ServiceAccount 上的 annotation
resource "aws_eks_pod_identity_association" "notification_service" {
  cluster_name    = aws_eks_cluster.main.name
  namespace       = "production"
  service_account = "notification-service"
  role_arn        = aws_iam_role.notification_service.arn
}

4.5 K8s 侧配置

# service-account.yaml — 注意:不需要 annotation
apiVersion: v1
kind: ServiceAccount
metadata:
  name: notification-service
  namespace: production
  # 不需要 eks.amazonaws.com/role-arn annotation
  # Pod Identity 通过 AWS API 关联,不依赖 K8s annotation

---
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: notification-service
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: notification-service
  template:
    metadata:
      labels:
        app: notification-service
    spec:
      serviceAccountName: notification-service
      containers:
      - name: notification-service
        image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/notification:v2.0
        ports:
        - containerPort: 8080
        env:
        - name: AWS_DEFAULT_REGION
          value: "ap-southeast-1"

4.6 IRSA vs Pod Identity 对比

维度 IRSA Pod Identity
OIDC Provider 需要创建和管理 不需要
IAM 信任策略 包含 OIDC URL(集群特定) 固定 pods.eks.amazonaws.com
关联方式 ServiceAccount annotation aws_eks_pod_identity_association
跨集群复用 Role 不行(信任策略绑定集群 OIDC) 可以(信任策略不含集群信息)
Fargate 支持 支持 不支持(截至 2025 年初)
最低 EKS 版本 1.13+ 1.24+
SDK 要求 较新版本 最新版本

五、实战场景:一个集群多个服务的完整配置

假设你的 EKS 集群里跑了 4 个服务,每个需要不同的 AWS 权限:

服务 Namespace 需要的 AWS 权限
order-service production DynamoDB (orders 表) + SQS (order-events)
payment-service production SQS (payment-queue) + KMS (解密)
image-processor production S3 (上传/下载图片) + Rekognition
log-shipper kube-system CloudWatch Logs + Kinesis Firehose

5.1 完整 Terraform 配置

# ============================================
# 所有 IRSA Role 定义
# ============================================

# 1. order-service
module "irsa_order" {
  source               = "./modules/irsa"
  cluster_name         = var.cluster_name
  oidc_provider_arn    = aws_iam_openid_connect_provider.eks.arn
  oidc_provider_url    = local.oidc_provider_url
  namespace            = "production"
  service_account_name = "order-service"
  policy_json = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid      = "DynamoDBAccess"
        Effect   = "Allow"
        Action   = ["dynamodb:GetItem", "dynamodb:PutItem", 
                    "dynamodb:UpdateItem", "dynamodb:Query", 
                    "dynamodb:BatchGetItem"]
        Resource = [
          "arn:aws:dynamodb:${var.region}:${data.aws_caller_identity.current.account_id}:table/orders",
          "arn:aws:dynamodb:${var.region}:${data.aws_caller_identity.current.account_id}:table/orders/index/*"
        ]
      },
      {
        Sid      = "SQSAccess"
        Effect   = "Allow"
        Action   = ["sqs:SendMessage"]
        Resource = "arn:aws:sqs:${var.region}:${data.aws_caller_identity.current.account_id}:order-events"
      }
    ]
  })
}

# 2. payment-service
module "irsa_payment" {
  source               = "./modules/irsa"
  cluster_name         = var.cluster_name
  oidc_provider_arn    = aws_iam_openid_connect_provider.eks.arn
  oidc_provider_url    = local.oidc_provider_url
  namespace            = "production"
  service_account_name = "payment-service"
  policy_json = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid      = "SQSAccess"
        Effect   = "Allow"
        Action   = ["sqs:ReceiveMessage", "sqs:DeleteMessage", 
                    "sqs:GetQueueAttributes"]
        Resource = "arn:aws:sqs:${var.region}:${data.aws_caller_identity.current.account_id}:payment-*"
      },
      {
        Sid      = "KMSDecrypt"
        Effect   = "Allow"
        Action   = ["kms:Decrypt"]
        Resource = aws_kms_key.payment.arn
      }
    ]
  })
}

# 3. image-processor
module "irsa_image" {
  source               = "./modules/irsa"
  cluster_name         = var.cluster_name
  oidc_provider_arn    = aws_iam_openid_connect_provider.eks.arn
  oidc_provider_url    = local.oidc_provider_url
  namespace            = "production"
  service_account_name = "image-processor"
  policy_json = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "S3Access"
        Effect = "Allow"
        Action = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"]
        Resource = "arn:aws:s3:::${var.image_bucket}/*"
      },
      {
        Sid    = "S3ListBucket"
        Effect = "Allow"
        Action = ["s3:ListBucket"]
        Resource = "arn:aws:s3:::${var.image_bucket}"
      },
      {
        Sid    = "RekognitionAccess"
        Effect = "Allow"
        Action = ["rekognition:DetectLabels", "rekognition:DetectFaces",
                  "rekognition:DetectModerationLabels"]
        Resource = "*"
      }
    ]
  })
}

# 4. log-shipper (kube-system namespace)
module "irsa_log_shipper" {
  source               = "./modules/irsa"
  cluster_name         = var.cluster_name
  oidc_provider_arn    = aws_iam_openid_connect_provider.eks.arn
  oidc_provider_url    = local.oidc_provider_url
  namespace            = "kube-system"
  service_account_name = "log-shipper"
  policy_json = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "logs:CreateLogGroup",
          "logs:CreateLogStream",
          "logs:PutLogEvents",
          "logs:DescribeLogStreams"
        ]
        Resource = "arn:aws:logs:${var.region}:${data.aws_caller_identity.current.account_id}:log-group:/eks/${var.cluster_name}/*"
      },
      {
        Effect   = "Allow"
        Action   = ["firehose:PutRecord", "firehose:PutRecordBatch"]
        Resource = "arn:aws:firehose:${var.region}:${data.aws_caller_identity.current.account_id}:deliverystream/eks-logs"
      }
    ]
  })
}

5.2 完整 K8s YAML

# 所有 ServiceAccount
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: order-service
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-cluster-production-order-service
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: payment-service
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-cluster-production-payment-service
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: image-processor
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-cluster-production-image-processor
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: log-shipper
  namespace: kube-system
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-cluster-kube-system-log-shipper
# order-service Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      serviceAccountName: order-service
      # 安全加固:不自动挂载默认 SA 的 token
      automountServiceAccountToken: true
      containers:
      - name: order-service
        image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/order-service:v1.2.0
        ports:
        - containerPort: 8080
        env:
        - name: AWS_DEFAULT_REGION
          value: "ap-southeast-1"
        - name: DYNAMODB_TABLE
          value: "orders"
        - name: SQS_QUEUE_URL
          value: "https://sqs.ap-southeast-1.amazonaws.com/123456789/order-events"
        securityContext:
          runAsNonRoot: true
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: "1"
            memory: 512Mi
---
# image-processor Deployment(需要更多资源)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: image-processor
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: image-processor
  template:
    metadata:
      labels:
        app: image-processor
    spec:
      serviceAccountName: image-processor
      containers:
      - name: image-processor
        image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/image-processor:v3.1
        env:
        - name: AWS_DEFAULT_REGION
          value: "ap-southeast-1"
        - name: S3_BUCKET
          value: "my-app-images"
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: "2"
            memory: 2Gi

六、特殊场景处理

6.1 CronJob 使用 IRSA

# CronJob 也可以用 IRSA,配置方式一样
apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-report
  namespace: production
spec:
  schedule: "0 2 * * *"  # 每天凌晨 2 点
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: report-generator  # 使用专用 SA
          containers:
          - name: report
            image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/report:v1
            command: ["python", "generate_report.py"]
            env:
            - name: AWS_DEFAULT_REGION
              value: "ap-southeast-1"
          restartPolicy: OnFailure

6.2 一个 ServiceAccount 多个 Deployment 共享

如果多个 Deployment 需要相同的 AWS 权限,可以共享同一个 ServiceAccount:

# 共享 SA:api-gateway 和 api-worker 都需要访问同一个 S3 bucket
apiVersion: v1
kind: ServiceAccount
metadata:
  name: api-shared
  namespace: production
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/my-cluster-api-shared
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  namespace: production
spec:
  template:
    spec:
      serviceAccountName: api-shared  # 共享
      containers:
      - name: api-gateway
        image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/api-gateway:v1
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-worker
  namespace: production
spec:
  template:
    spec:
      serviceAccountName: api-shared  # 共享同一个 SA
      containers:
      - name: api-worker
        image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/api-worker:v1

6.3 跨账号访问

服务在账号 A 的 EKS 里运行,但需要访问账号 B 的 S3:

# 账号 A:IRSA Role 有权限 assume 账号 B 的 Role
module "irsa_cross_account" {
  source               = "./modules/irsa"
  cluster_name         = var.cluster_name
  oidc_provider_arn    = aws_iam_openid_connect_provider.eks.arn
  oidc_provider_url    = local.oidc_provider_url
  namespace            = "production"
  service_account_name = "data-sync"
  policy_json = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = "sts:AssumeRole"
      Resource = "arn:aws:iam::999888777:role/cross-account-s3-access"
    }]
  })
}

# 账号 B:创建被 assume 的 Role
resource "aws_iam_role" "cross_account_s3" {
  provider = aws.account_b
  name     = "cross-account-s3-access"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = {
        AWS = "arn:aws:iam::123456789:role/my-cluster-production-data-sync"
      }
      Action = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy" "cross_account_s3" {
  provider = aws.account_b
  role     = aws_iam_role.cross_account_s3.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["s3:GetObject", "s3:ListBucket"]
      Resource = ["arn:aws:s3:::account-b-data", "arn:aws:s3:::account-b-data/*"]
    }]
  })
}
# 应用代码:先 assume 账号 B 的 Role,再访问 S3
import boto3

# IRSA 自动提供账号 A 的凭证
sts = boto3.client('sts')

# Assume 账号 B 的 Role
assumed = sts.assume_role(
    RoleArn='arn:aws:iam::999888777:role/cross-account-s3-access',
    RoleSessionName='data-sync'
)

# 用账号 B 的临时凭证访问 S3
s3 = boto3.client('s3',
    aws_access_key_id=assumed['Credentials']['AccessKeyId'],
    aws_secret_access_key=assumed['Credentials']['SecretAccessKey'],
    aws_session_token=assumed['Credentials']['SessionToken']
)

data = s3.get_object(Bucket='account-b-data', Key='export/latest.csv')

6.4 Init Container 使用不同权限

# 场景:init container 需要从 Secrets Manager 拉取密钥
# 主容器需要访问 DynamoDB
# 两者共享同一个 SA,所以 Role 要包含两种权限
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-app
  namespace: production
spec:
  template:
    spec:
      serviceAccountName: secure-app  # Role 包含 SecretsManager + DynamoDB 权限
      initContainers:
      - name: secret-fetcher
        image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/secret-fetcher:v1
        command: ["sh", "-c"]
        args:
        - |
          aws secretsmanager get-secret-value \
            --secret-id prod/db-credentials \
            --query SecretString --output text > /secrets/db-creds.json
        volumeMounts:
        - name: secrets
          mountPath: /secrets
      containers:
      - name: app
        image: 123456789.dkr.ecr.ap-southeast-1.amazonaws.com/secure-app:v2
        volumeMounts:
        - name: secrets
          mountPath: /secrets
          readOnly: true
      volumes:
      - name: secrets
        emptyDir:
          medium: Memory  # 内存存储,不落盘

七、安全最佳实践

7.1 最小权限原则

// 错误:给了整个 DynamoDB 的权限
{
  "Effect": "Allow",
  "Action": "dynamodb:*",
  "Resource": "*"
}

// 正确:只给需要的操作和具体的表
{
  "Effect": "Allow",
  "Action": [
    "dynamodb:GetItem",
    "dynamodb:Query"
  ],
  "Resource": [
    "arn:aws:dynamodb:ap-southeast-1:123456789:table/orders",
    "arn:aws:dynamodb:ap-southeast-1:123456789:table/orders/index/user-id-index"
  ]
}

7.2 条件限制

// 限制只能从特定 VPC 访问
{
  "Effect": "Allow",
  "Action": ["s3:GetObject"],
  "Resource": "arn:aws:s3:::sensitive-data/*",
  "Condition": {
    "StringEquals": {
      "aws:SourceVpc": "vpc-0123456789abcdef0"
    }
  }
}

// 限制只能访问特定前缀的 S3 对象
{
  "Effect": "Allow",
  "Action": ["s3:GetObject", "s3:PutObject"],
  "Resource": "arn:aws:s3:::my-bucket/tenant-a/*",
  "Condition": {
    "StringLike": {
      "s3:prefix": ["tenant-a/*"]
    }
  }
}

7.3 Pod Security 加固

# 配合 IRSA 使用的安全加固 Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-service
  namespace: production
spec:
  template:
    spec:
      serviceAccountName: secure-service
      # 不使用 host 网络
      hostNetwork: false
      # 不使用 host PID
      hostPID: false
      securityContext:
        # Pod 级别安全上下文
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: app
        image: my-app:v1
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: ["ALL"]
        # 只读文件系统需要 tmp 目录
        volumeMounts:
        - name: tmp
          mountPath: /tmp
      volumes:
      - name: tmp
        emptyDir: {}

7.4 审计与监控

# CloudTrail 监控 AssumeRoleWithWebIdentity 调用
resource "aws_cloudwatch_log_metric_filter" "irsa_assume_role" {
  name           = "irsa-assume-role-failures"
  pattern        = "{ ($.eventName = \"AssumeRoleWithWebIdentity\") && ($.errorCode = \"*\") }"
  log_group_name = aws_cloudwatch_log_group.cloudtrail.name

  metric_transformation {
    name      = "IRSAAssumeRoleFailures"
    namespace = "Custom/EKS"
    value     = "1"
  }
}

resource "aws_cloudwatch_metric_alarm" "irsa_failures" {
  alarm_name          = "eks-irsa-assume-role-failures"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "IRSAAssumeRoleFailures"
  namespace           = "Custom/EKS"
  period              = 300
  statistic           = "Sum"
  threshold           = 5
  alarm_description   = "IRSA AssumeRole 失败次数异常,可能有未授权访问尝试"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

八、排障指南

8.1 常见问题排查流程

# 问题:Pod 报 AccessDenied 或 NoCredentialProviders

# 1. 确认 ServiceAccount 是否正确关联
kubectl get sa order-service -n production -o yaml
# 检查 annotations 里有没有 eks.amazonaws.com/role-arn

# 2. 确认 Pod 是否使用了正确的 ServiceAccount
kubectl get pod -n production -l app=order-service -o jsonpath='{.items[0].spec.serviceAccountName}'

# 3. 确认 IRSA 环境变量是否注入
kubectl exec -it deploy/order-service -n production -- env | grep AWS_ROLE_ARN
kubectl exec -it deploy/order-service -n production -- env | grep AWS_WEB_IDENTITY_TOKEN_FILE

# 4. 确认 token 文件是否存在
kubectl exec -it deploy/order-service -n production -- \
  cat /var/run/secrets/eks.amazonaws.com/serviceaccount/token | head -c 50

# 5. 确认 Pod 的实际身份
kubectl exec -it deploy/order-service -n production -- \
  aws sts get-caller-identity

# 6. 如果身份是 Node Role 而不是 IRSA Role:
#    - 检查 EKS 的 Mutating Webhook 是否正常
kubectl get mutatingwebhookconfigurations | grep eks
#    - 检查 OIDC Provider 是否正确
aws eks describe-cluster --name my-cluster --query "cluster.identity.oidc"
#    - 检查 IAM Role 的信任策略中 OIDC URL 是否匹配

# 7. 手动测试 AssumeRoleWithWebIdentity
TOKEN=$(kubectl exec deploy/order-service -n production -- \
  cat /var/run/secrets/eks.amazonaws.com/serviceaccount/token)
aws sts assume-role-with-web-identity \
  --role-arn arn:aws:iam::123456789:role/my-cluster-production-order-service \
  --role-session-name test \
  --web-identity-token "$TOKEN"

8.2 常见错误及解决

错误信息 原因 解决方案
An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity IAM Role 信任策略中的 OIDC URL 或 SA 名称不匹配 检查 Role 信任策略中的 sub 条件是否为 system:serviceaccount:namespace:sa-name
NoCredentialProviders IRSA 环境变量未注入 检查 SA annotation、EKS webhook、Pod 是否指定了 serviceAccountName
InvalidIdentityToken OIDC Provider 的 thumbprint 过期或不匹配 更新 OIDC Provider 的 thumbprint
ExpiredTokenException Token 过期(默认 24 小时) 确保 SDK 版本支持自动刷新 token,升级 AWS SDK
Pod 使用了 Node Role 而非 IRSA Role Webhook 未注入或 SA 配置错误 kubectl describe pod 检查 volumes 中是否有 aws-iam-token

九、总结与选型决策树

场景 推荐方案
节点自身需要的权限(ECR、CNI、日志) Node Group IAM Role
业务 Pod 权限,EKS >= 1.24 且不用 Fargate EKS Pod Identity(最简单)
业务 Pod 权限,EKS 版本较老 或 使用 Fargate IRSA(最成熟)
需要跨账号访问 IRSA / Pod Identity + AssumeRole 链
有人提议用 Access Key 拒绝,没有例外

核心原则:每个服务一个 ServiceAccount,每个 ServiceAccount 一个最小权限的 IAM Role。这是 EKS 安全的基石。配置虽然比直接给 Node Group 加权限麻烦一些,但在安全审计、故障排查、权限回收时会感谢自己当初的选择。

葫芦的运维日志

打赏

留言板

留言提交后需管理员审核通过才会显示

© 冰糖葫芦甜(bthlt.com) 2025 王梓打赏联系方式陕ICP备17005322号-1